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Abstract 

We study the dynamics of supervised learning in layered neural networks, in the regime where the 
size p of the training set is proportional to the number N of inputs. Here the local fields are no 
longer described by Gaussian probability distributions and the learning dynamics is of a spin-glass 
nature, with the composition of the training set playing the role of quenched disorder. We show how 
dynamical replica theory can be used to predict the evolution of macroscopic observables, including 
the two relevant performance measures (training error and generalization error) , incorporating the 
old formalism developed for complete training sets in the limit a = p/N — > oo as a special case. 
For simplicity we restrict ourselves in this paper to single-layer networks and realizable tasks. 



PACS: 87.10. -fe, 02.50.-r, 05.20.-y 
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1 Introduction 



In the last few years much progress has been made in the analysis of the dynamics of supervised 
learning in layered neural networks, using the strategy of statistical mechanics: by deriving from the 
microscopic dynamical equations of the learning process a set of closed laws describing the evolution 
of suitably chosen macroscopic observables (dynamic order parameters), in the limit of an infinite 
system size (e.g. |3|, ^, |5|. A recent review and more extensive guide to the relevant references 

can be found in |^, ^. also contains a preliminary presentation of some of the results in the present 
paper, without proofs or derivations. The main successful procedure developed so far is built on the 
following four cornerstones: 

• The task to be learned by the network is defined by a (possibly noisy) 'teacher', which is itself a 
layered neural network. This induces a canonical set of dynamical order parameters, typically 
the (rescaled) overlaps between the various student weight vectors and the corresponding teacher 
weight vectors. 

• The number of network inputs is (eventually) taken to be infinitely large. This ensures that 
fluctuations in mean-field observables will vanish, and creates the possibility of using the central 
limit theorem. 

• The number of 'hidden' neurons is finite. This prevents the number of order parameters from 
being infinite, and ensures that the cumulative impact of their fluctuations is insignificant. 

• The size of the training set is much larger than the number of weight updates made. Each 
example presented to the system is now different from those that have already been seen, such 
that the local fields will have Gaussian probability distributions, which leads to closure of the 
dynamic equations. 

These are not ingredients to simplify the calculations, but vital conditions, without which the standard 
method fails. Although the assumption of an infinite system size has been shown not to be too critical 
[§, the other assumptions do place serious restrictions on the degree of realism of the scenarios that 
can be analyzed, and have thereby, to some extent, prevented the theoretical results from being used 
by practitioners. 

Here we study the dynamics of learning in layered neural networks with restricted training sets, 
where the number p of examples ('questions' with corresponding 'answers') scales linearly with the 
number N of inputs, i.e. p = aN with < a < oo. Here individual questions will re-appear during 
the learning process as soon as the number of weight updates made is of the order of the size of 
the training set. In the traditional models, where the duration of an individual update is defined as 
A^~^, this happens as soon as t = 0{a). At that point correlations develop between the weights and 
the questions in the training set, and the dynamics is of a spin-glass type, with the composition of 
the training set playing the role of 'quenched disorder'. The main consequence of this is that the 
central limit theorem no longer applies to the student's local fields, which are now indeed described by 
non-Gaussian distributions. To demonstrate this we trained (on-line) a perceptron with weights Jj on 
noiseless examples generated by a teacher perceptron with weights Bi, using the Hebb and AdaTron 
rules. We plotted in Fig. 1 the student and teacher fields, x = J-^ and y = B-^ respectively, where 
^ is the input vector, for p = N/2 examples and at time t = 50. The marginal distribution P{x) 
for p = N/4, at times t = 10 for the Hebb rule and t = 20 for the Adatron rule, is shown in Fig. 2. 
The non-Gaussian student field distributions observed in Figs. 1 and 2 induce a deviation between 
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Figure 1: Student and teacher fields {x,y) = {J B as observed during numerical simulations of 
on-line learning (learning rate 7/ = 1) in a perceptron of size = 10, 000 at t = 50, using 'questions' 
from a restricted training set of size p = N/2. Left: Hebbian learning. Right: AdaTron learning. 
Note: in the case of Gaussian field distributions one would have found spherically shaped plots. 



the training- and generalization errors, which measure the network performance on training and test 
examples, respectively. The former involves averages over the non-Gaussian field distribution, whereas 
the latter (which is calculated over all possible examples) still involves Gaussian fields. 

The appearance of non-Gaussian fields leads to a complete breakdown of the standard formalism, 
based on deriving closed equations for a finite number of observables: the field distributions can no 
longer be characterized by a few moments, and the macroscopic laws must now be averaged over 
realizations of the training set. One could still try to use Gaussian distributions as large a approxima- 
tions, see e.g. [^], but it will be clear from Figs. 1 and 2 that a systematic theory will have to give up 
Gaussian distributions entirely. The first rigorous study of the dynamics of learning with restricted 
training sets in non-linear networks, via the calculation of generating functionals, was carried out in 
[10 1 for perceptrons with binary weights. The only cases where explicit and relatively simple solutions 
can be obtained, even for restricted training sets, are those where linear learning rules are used, such 
as or m. 

In this paper we show how the formalism of dynamical replica theory (see e.g. p3[) can be used 
successfully to predict the evolution of macroscopic observables for finite a, incorporating the infinite 
training set formalism as a special case, for a ^ oo. Central to our approach is the derivation of a 
diffusion equation for the joint distribution P[x, y] of the student and teacher fields, which will be found 
to have Gaussian solutions only for a — > oo. For simplicity and transparency we restrict ourselves 
in the present paper to single-layer systems and noise-free teachers. Application and generalization 
of our methods to multi-layer systems |14] and learning scenarios involving 'noisy' teachers |15| are 
presently under way. 

Our paper is organized as follows. In section 2 we first derive a Fokker-Planck equation describing 
the evolution of arbitrary mean- field observables for N ^ oo. This allows us to identify the conditions 
for the latter to be described by closed deterministic laws. We then choose as our observables the joint 
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Figure 2: Distribution P{x) of student fields as observed during numerical simulations of on-line 
learning (learning rate = 1) in a perceptron of size = 10, 000, using 'questions' from a restricted 
training set of size p = N/A. Left: Hebbian learning, measured at t = 10. Right: AdaTron learning, 
measured at t = 20. Note: not only are these distributions distinctively non-Gaussian, they also 
appear to vary widely in their basic characteristics, depending on the learning rule used. 



field distribution P[x, y], in addition to (the traditional ones) Q and R, and show that this set {Q, R, P} 
obeys deterministic laws. In order to close these laws we use the tools of dynamical replica theory. 
Details of the replica calculation are given in an Appendix, so that they can be skipped by those 
primarily interested in results. In section 3 we summarize the final replica-symmetric macroscopic 
theory and its notational conventions, discuss some of its general properties, and show how in the 
limit a ^ oo (infinite training sets) the equations of the conventional theory are recovered. In a 
subsequent paper we will work out and apply our equations explicitly for several types of learning rules, 
and compare the predictions of our theory with exact results (derived directly from the microscopic 
equations, for Hebbian learning [^) and with numerical simulations. 

2 From Microscopic to Macroscopic Laws 
2.1 Definitions 

A student perceptron operates the following rule, which is parametrised by a weig ht vector J G 3?^: 

5: {-1,1}^ ^{-1,1} Si^)= sgn[J.^] (1) 

It tries to emulate the operation of a teacher perceptron, which is assumed to operate a similar rule, 
characterized by a given (fixed) weight vector B G K^: 

T : {-1, 1}^ ^ {-1, 1} T(0 = sgn [B ■ ^] (2) 

In order to improve its performance, the student perceptron modifies its weight vector J according 
to an iterative procedure, using examples of input vectors (or 'questions') ^, drawn at random from a 
fixed training set D Q D = {—1, 1}^, and the corresponding values of the teacher outputs T(^). 
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We will consider the case where the training set is a randomly composed subset D C D, of size 
\D\ = p = aN with a > 0: 

Z> = {^i,...,^P} p = aN e D for all ^1 (3) 

We will denote averages over the training set D and averages over the full question set D in the 
following way: 

(^(0)d = 757 ^ ^^(^^^^ ^ M ^ ^^^^ • 

We will analyze the following two classes of learning rules: 

on — line : J(m+1) = J(m) + Q [J{m)-^{m), B-^{m)] 

batch: J(m + 1) = J(m) + ^ C?[J(m)-^,B-^])^ 

In on-line learning one draws at each iteration step m a question ^(m) G D at random, the dynamics 
is thus a stochastic process; in batch learning one iterates a deterministic map. The function 
is assumed to be bounded and not to depend on N, other than via its two arguments. 

Our most important observables during learning are the training error Et{J) and the generalization 
error Eg{J), defined as follows: 

E,{J) = {e[-{J.^){B.mD EgiJ) = {0[-{J<){B-mD . (5) 

Only if the training set D is sufficiently large, and if there are no correlations between J and the 
questions ^ S D, will these two errors will be identical. 

We next convert the dynamical laws (Q) into the language of stochastic processes. We introduce 
the probability Pm{J) to find weight vector J at discrete iteration step m. In terms of this microscopic 
probability distribution the processes @ can be written in the general Markovian form 

Pm+l{J) = J dj' W[J; J'] Pm{J') , (6) 

with the transition probabilities 

on-line: W[J; J'] = {6 [J ~ J' ^ G [J' B-^]])^ 
batch : W[J; J'] = 6 [J -J' Q [J'-^, B-^])^] 

We make the transition to a description involving real-valued time labels by choosing the duration 
of each iteration step to be a real-valued random number, such that the probability that at time t 
precisely m steps have been made is given by the Poisson expression 



TTrnit) = i-(iVt)-e-^* . (8) 

ml 



For times t ^ N ^ we find t = m/N + 0{N 2), the usual time unit. Due to the random durations 
of the iteration steps we have to switch to the following microscopic probability distribution: 

Pt{J) = J2 ^rn{t) Pm{J) ■ (9) 
m>0 
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This distribution obeys a simple differential equation, which immediately follows from the pleasant 
properties of (P) under temporal differentiation: 

j^pt{J) = N jdJ' {W[J;J']-6[J-J']] PtiJ') . (10) 

So far no approximations have been made, equation (|lO|) is exact for any N . It is the equivalent of 
the master equation often introduced to define the dynamics of spin systems. 



2.2 Derivation of Macroscopic Fokker-Planck Equation 

We now wish to investigate the dynamics of a number of as yet arbitrary macroscopic observables 
Q,[J] = (r2i[J], . . . ,rifc[J]). To do so we introduce a macroscopic probability distribution 



Pt{n) = JdJ pt{J)s [n - ft[j] 



(11) 



Its time derivative immediately follows from that in (10): 
d 



■^Pt{^) = N JdJdJ' 6 [ft-n[J]] {W[J; J']-6[J -J']} pt{J') 
N Jdft' JdJdJ' 6[fl-ft[J]]6[ft'-ft[J']] {W[J;J']-5[J-J']}pt{J') 



This then can be written in the standard form 

d 



where 



Wt[ft;ft'] 



JdJ' pt{J')5 jdj 5 n {w[j- j']-5[j-j']} 



(12) 



j'dj' pt{j')5[n'-n[j'\] 

If we now insert the relevant expressions (^) for J'\ we can perform the J-integrations, and 

obtain results given in terms of so-called sub-shell averages, which are defined as 

_ SdJ pt{J)5[n-n[J]]f{J) 
))a,t JdJ pt{j)5[fi-n[j]] 

For the two classes of learning rules at hand we obtain: 



Wr[fl;fl']= N {{6 



ii-n[j+^^g[j.i,B.^]] 



)i)-6[n-fi[j] 



W]^^^[n;n'] =N{5 



n-n[j+J^{^g[j.^,B.i])f,] 



-6[ft-n[j] 



We now insert integral representations for the (5-distributions. The observables Cl[J] G are assumed 
to be 0{1) each, and finite in number (i.e. k <^ N): 



d[n-Q] 



(27r)'= 



(13) 
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which gives for our two learning scenario's: 



(27r)^ 
(2^ 



^ ^ ^^-iCt-ftiJ+^^giJ-^,B-^]]^^_^-in-niJ] 



iCt-n[j]\ 



In' 



(14) 



(15) 



Stih no approximations have been made. The above two expressions differ only in at which stage the 
averaging over the training set occurs. 

In expanding equations (14,15) for large N and finite t we have to be careful, since the system 
size enters both as a small parameter to control the magnitude of the modification of individual 
components of the weight vector, but also determines the dimensions and lengths of various vectors 
that occur. We therefore inspect more closely the usual Taylor expansions: 
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N 



N 



F[j+k]-F[j] = j2-j2...j2h, 

e>l ii=l ii=l 



d^F[j] 



If we assess how derivatives with respect to individual components Jj scale for mean-field observables 
such as Q[J] = and R[J] = B J, we find the following scaling property which we will choose as 
our definition of simple mean-field observables: 



F[J] = 0{N^ 



d^F[J] 



dJi^ ■ ■ ■ dJi^ 



O 



{\J\-'N¥-d) {N^oo) 



(16) 



in which d is the number of different elements in the set {ii, . . . , ii}. For simple mean-field observables 
we can now estimate the scaling of the various terms in the Taylor expansion. However, we will find 
that for restricted training sets not all relevant observables will have the properties ([l6|). In particular, 
the joint distribution of student and teacher fields will, for on-line learning, have a contribution for 
which all terms in the Taylor series will have to be summed, giving rise to an additional term A[J; k] 
^ The latter type of more general mean-field observables will have to be defined via the identities 



F[J+k] - F[J] = A[J; fc] + E + i E hk,^^ + ^ O 



dJidJj 



e>3 



\k\ 



(17) 



F[J] = 0(iV°), 



A[J;k] = O (|fe|V|J 



(18) 



(in the assessment of the order of the remainder terms of (p^) we have used X)i — 0{^/N\k\)). 
Simple mean-field observables correspond to A[J;fc] = 0. 

We expand our macroscopic equations (14,1^) for large N and finite times, restricting ourselves 
from now on to mean-field observables in the sense 

of (OH)- One of our observables we choose to be 
. In the present problem the shifts fc, being either G[J • ^, -B • ^] or -^(^ Q[J ■ $,B ■ scale 

as |fc| = 0{N~2y Consequently: 



:J2k,^(n.n[j])-'-Y^k,k 



92 



^ dJidJj 



^We are grateful to Dr. Yuan-sheng Xiong for alerting us to this important point. 
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d 



+ OiN- 



This, in turn, gives 



(2vr) 



' ^-ih-ii[j^]_^-ih-n[j] 



d 



dJidJj 



1 d' 



dJi dJj 



6[n-n[j]] + oiN^ 



It is now evident, in view of (14,15), that both types of dynamics are described by macroscopic laws 
with transition probability densities of the general form 



which, due to (|T^ and for N ^ oo and finite times, leads to a Fokker-Planck equation: 



11=1 ^ ^ii>=i ^ 



dt 



(19) 



The differences between the two types of dynamics are in the explicit expressions for the flow- and 
diffusion terms: 



N 



+ ^E(e.e#[J-^,s-^]) 



dJi 

^ ^J^^Jj L;t 



'D 



dJi 



dJi 



' f2;t 



N 



F^^'in- 1] = l\m^ ( NA^[J- ^{^ G[J ^])^] + r?^(e. Q[J tB ^])d^ 



D 



dJi 



Equation ([l^) allows us to define the goal of our exercise in more explicit form. If we wish to arrive 
at closed deterministic macroscopic equations, we have to choose our observables such that 
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1. limjv^oo Gfiu['^',t] = (this ensures determinism) 

2. limAr_»oo = (this ensures closure) 

In the case of having time-dependent global parameters, such as learning rates or decay rates, the 
latter condition relaxes to the requirement that any explicit time-dependence of -F^[ri;t] is restricted 
to these global parameters. 



2.3 Choice and Properties of Canonical Observables 



We next apply the general results obtained so far to a specific set of observables, {Q,R,P}, 
which are tailored to the problem at hand (note that we restrict ourselves to = 0{1) and = 1): 



Q[J] = J\ 



R[J] = J B, 



P[x,y;J] = {5[x-J-^] 6[y-B-^]) 



D 



(20) 



with x,y £ ?R.. This choice is motivated by the following considerations: (i) in order to incorporate the 
standard theory in the limit a — > oo we need at least Q[J] and R[J], (ii) we need to be able to calculate 
the training error, which involves field statistics calculated over the training set D, as described by 
P[x, y; J], and (iii) for finite a one cannot expect closed macroscopic equations for just a finite number 
of order parameters, the present choice (involving the order parameter function P[x,y;J]) represents 
effectively an infinite number In subsequent calculations we will, however, assume the number of 
arguments {x,y) for which P[x,y;J] is to be evaluated (and thus our number of order parameters) 
to go to infinity only after the limit — > oo has been taken. This will eliminate many technical 
subtleties and will allow us to use the Fokker-Planck equation (p!s|). 



The observables (20) are indeed of the general mean- field type in the sense of (17,18). Insertion 
into the stronger condition (|l^ ) immediately shows this to be true for the scalar observables Q[J] 
and R[J] (they are simple mean field observables, for which the term (|l8|) is absent). Verification of 



(17, rs]) for the function P[x,y;J] is less trivial. We denote with I the set of all different indices in 
the list («!,..., f^), with giving the number of times a number k occurs, and with C X defined 
as the set of all indices k £ Z for which nj. is even (-I-), or odd (— ). Note that with these definitions 
£ = J2k&J+ ''^k + J2k£i~ ''^k > 2|X+| -I- |X~|. We then have: 



d'P[x,y;J] 



dx^ 



dx dy 



,i[xx+yy\ 



n 

fcex 



kij 



-i£.k [xJk+yBk 



D 



Upon writing averaging over all training sets of size p 
probability) as (. . .)sets! this allows us to conclude 

I d'P[x,y;J]\ 



aN (where each realization of D has equal 



sets 



Since ^ 



■|x|+i|x- 



|X~|— 2|X+|] > 0, the average over all training sets of the function P[x, y; J] 
is found to be a simple mean-field observable in the sense of (p^). 

The scaling properties of expansions or derivations of y; J] for a given training set D, however, 
need not be identical to those of its average over all training sets {P[x,y; J])sets- Here we have to use 

■^A simple rule of thumb is the following: if a process requires replica theory for its stationary state analysis, as does 
learning with restricted training sets, its dynamics is of a spin-glass type and cannot be described by a finite set of closed 
dynamic equations. 
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the fact that D has been composed in a random manner, as well as the specific form of the shifts k in 
P[x,y; J+k] that occur for the two types of dynamics under consideration: 



P[x,y;J + k]- P[x,y;J] 



i[xx+yy] 

(2vr)2 



P 



ik-$' 



1 



All complications are caused by the dependence of k on the composition of the training set D, and 
would therefore have been absent in the a ^ oo case. This dependence will turn out to be harmless 
in the case of batch learning, where k = jf-{^G[J ■ B ■ is an average over D, but will have a 
considerable impact in the case of on-line learning, where k = jj-^Q[J B is proportional to an 
individual member of D. Working out the relevant expression for on-line learning gives 



P[x,y;J + k°'^']-P[x,y;J] 



dx dy 



P 



dxdy 



^i[xx+yy]^-ixJ-^~iyB-^ 



^~ir,xglJ-$,B-^] _ ^ 



1 



+ ir]xg[J-^,B-^] + -7]'x'g'[J-$,B-^] 



onl ;^onl 

^' dJidJ. 



P[x,y;J]+0{N-l) 



We conclude that, at least for the purpose of the expansions relevant to on-line learning, P[x,y;J] is 
d o 



a mean field observable in the sense of (17,^), with the non-trivial contribution of (|T8|) given by 
1 



P 



6[x-j-^-vg[J-tB-my-B-^]-5[x-j-my-B-^] 



+ r?^ [G[x,y]5[x-J-my-B-^]] - 

,onli 



G\x,y\b\x-J-i\b{y-B-i\ 



(21) 



Note that limTv^oo iVA[J; k°'^^\ = 0{rf /a), so that for small learning rates or large training sets this 
non-trivial term will vanish. Working out the relevant expression for batch learning, on the other 
hand, gives 

p 



P[x,y-J+k''^']-P[x,y-J] 



dx dy 



,i[xx+yy\ 



E' 



iJ-t-iyB-^" 



'-^g[j.^t^,B-e] + 0{N-l' 

P 



1 



92 3 

P[x,y;J]+0{N-2] 



Here the term A[J; k^^^] is absent. In fact also the quadratic contribution J^ij kf^^k^^^ ... in the above 
expansion will turn out to be of insignificant order in N. For the purpose of the expansions relevant 
to batch learning, P[x,y;J] is apparently a simple mean field observable in the sense of (^). This 
could have been anticipated, since one should ultimately obtain the batch learning equations upon 
expanding those of on-line learning for small learning rate rj, and retaining only the leading order r/^ 
in this expansion. 
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2.4 Derivation of Deterministic Dynamical Laws 

Having defined our order parameters Q, R and {P[x^ y]}, from this stage onwards the notation (• • ■)(^-^t 
win be used to denote sub-sheh averages defined with respect to these order parameters, at time t. 
With a modest amount of foresight we define the complementary Kronecker delta 5ab = 1 — ^ab j and 
the following key functions: 



A[x,y;x\y'] = lim ( (( 5,.,{^-£!)5[x-J -my-B-m^'-J -i'W-B-i!] 



(22) 



B[x,y;x',y'] = lim S,,^{^,^j^',QS[x-J-my-B-mx'-J.my'-B-a )^)^ ) 

(23) 

C[x,y\x' ,y']x" ,y"] = lim 



(((^^^-'^^'^''^^^^^^^^^^^ )D)b)b 

(24) 



qpP;t 



We will eventually show in a subsequent section that ( |23D and (|24| ) are zero. The function (^), on 
the other hand, will contain all the interesting physics of the learning process, and its calculation will 
turn out to be our central problem. 

We next show that for the observables (^0[) the diffusion matrix elements G*** in the Fokker-Planck 



equation ( [19| ) vanish for — > oo. Our observables will consequently obey deterministic dynamical 
laws. Calculating diffusion terms associated with Q[J] and R[J] is trivial: 



/^onl r 

/^onl r 
^QRX 

Gonl r 
RRV 



lim — 



dxdy P[x,y] Q'^[x,y] 



Ax' 
2xy 



lim — 



4 I J dxdy P[x,y] xQ[x,y 
2\ Jdxdy P[x;y] xg[x,y] \ \ Jdxdy P[x;y] yQ[x,y] 



Jdxdy P[x;y] yQ[x,y] 



We next turn to diffusion terms with one occurrence of P[x,y;J]. Here we repeatedly build on the 
cornerstone assumption that all fields J-^ and B-^ are of order unity (which is clear from numerical 
simulations, and will be supported self-consistently by the equations resulting from our theory), in 
combination with two simple scaling consequences of the random composition of D, as N ^ oo: 
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For on-line learning we find: 



E E [1-%] i^-e 



0{N 



(25) 



G\ 



onl 



2J-^ 



{^■mx-j-my-B-i'] 



ID ID 
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N 



2 J I 



{i<'W-J-i'W-B-i'] )f,h 



2x 

y 



cpp;t 



-rf^ lim l0{N-h + 0(N-^)) =0 



For batch learning we find: 



/-^bat r 1 

/-^bat r 1 
^R,P[x,y\V--\ 



N-*oo 



2x' 



X ((( g[j.^,B-^]{^.^')S[x-j-my-B-^'] 



CpP;t 



-if^ ldx'dy'P[x',y']g[x',y' 



2x' 



Jirn^^ g\x,y\{{ ?,^^8\x-J-i\S\y-B.i\ )^)^ 



+ ^(( \\-S^^\Q\3-i.B.m-^)8\x-J-my-B-^\ )3)d 



CpP]t 



= -r?2— lim (o(N-^)+0(N-h) =0 

The difficult terms are those where two derivatives of the order parameter function P[x,y\J] come 
into play. Here we have to deal separately with four distinct contributions, defined according to which 
of the vectors from the trio {^,4') 4"} identical. For on-line learning we find: 



/ qpp;t 

''^^d^'N^ooi NQ^^^yW -A^y -y\ Hi s^^"S^'^"S[x-J-my-B-^] )^)^)^ 

+ g^[x',y'] {{{6^^.S^,^.i^-^')S[x-J-my-B-m^'-J<W-B-^'] )f))D)D 

+ Q\x.y] {{{ s^^"d^>^,i$-^')6[x-j-my-B-mx'-J-my'-B-^'] )^)^)^ 



2 Q2 



+ {{{ 5^^"5^'^"^2[J-r'S-n^^^^^^^^^ 



qpp;t 



= V^g^ I Jirn^ (0{N-') + 0{N--2) + ^l^"^))^.^ + / dx"dy"g^[x", y"] C[x, y; x' , y' ■ x\y"] 

= rf j dx"dy"g'[x", y"] g^C[x, y; x' , y'; a;", /] 
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Similarly: 



^P[.,y],Pl.',y']l ■ -J - ^i™^ 

{{g[j-^',B-m-m^-j-my-B-$%)f, {{g[j-^,B-m<'w-j-my'-B-a)D)D 



xlg[x',y']{{6^^,6[x'-jmy'-B-$]hh + {{6^^>g[J-tB-$]^6[x'-J-m^ 



qpP;f 



.2 



J lim /[0{N-^) + 0{N-^)}[0{N-^) + 0{N-^)}\ =0 



dxdx' Af-»oo \ L J L J / c^-i 

For batch learning all diffusion matrix elements of (|l^) vanish in a straightforward manner. For on-line 
learning all diffusion terms vanish provided we can prove that the function C[. . .] of ( p4| ) is zero. This 
is indeed the case within the present theory, as will be verified in the Appendix. The Fokker-Planck 
equation (19) now reduces to the Liouville equation ^Pf(fi) = —J2fj. '^j^[F'ii[^',t]Pt{Cl)], describing 

deterministic evolution for our macroscopic observables: -^Cl = Flfl; t]. These deterministic equations 
we will now work out explicitly. 

On-Line Learning 

First we deal with the scalar observables Q and R: 



-Q = hm 2r/ ( {{J .^)g[J B.^])j, ) + r/^ ( (^^[J-^, S-^])^ 

at N^co \ / cpp;t \ / m';t 

= 2r] J dxdy y] x g[x, v] + j dxdy P[x, y] g'^[x, y] 
i?= lim ij/iiB-OgiJ-tB-^])^) =7] [dxdy P[x,y]yg[x,y] 



d 

~dt 



These equations are identical to those found in the a — > cxd formalism. The difference is in the function 
to be substituted for which here is the solution of 



^\h^\x-J-i-^g{J-i,B-i\\b{y-B-i\-b{x-J-my-B-i\)f^ 

a \ 

-ri^ [g[x,yMx-J.my-B-$])f,] - ^r?2^ [g^[x,y]{6[x-J-my-B-^] 



Id / 

/ QFfP;t 
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(where we have inserted (|2l|)) 

dx' P[x' ,y]6[x — x' —riQ[x' ,y]] — P[x,y] 



+1^^-^ jdx'dy' B[x,y-x\y']g\x\y'] 



Anticipating the term . .] to be zero (as shown in the Appendix) we thus arrive at the following set 
of coupled deterministic macroscopic equations 

= '^'n j dxdy P[x,y] x g[x,y] + J dxdy P[x,y] Q'^[x,y] (26) 
R = V J dxdy P[x, y] y g[x, y] (27) 



d 

di' 



^P[x,y] = i <{ I dx' P[x',y]5[x-x'-r]g[x',y]] - P[x,y] 



Batch Learning 

For Q and R one again finds simple equations: 



— '^V J dxdy P[x,y] x G[x;y] 

^R= lira r]({{B-^)g[J-tB-$])^) = r] [ dxdy P[x,y] y g[x;y] 
Finally we calculate the temporal derivative of the joint field distribution: 

^P{x,y\ = jim I -r;A(((g[j.^',B.^'](^.0<^[x-J-^]%-B.^])^)^ 



dt N^oo dx 



71 d d f 

= ^ [g[x,y]P[x,y]] -r]— / dx'dy' A[x,y;x' ,y']Q[x' ,y'] 

a ox ox J 

1 f 

+ 2^'^ J dx'dy' dx"dy"C[x, y; x',y'; x" ,y"]g[x' ,y']g[x" ,y"] 



2 
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Anticipating the term C[. . .] to be zero (to be demonstrated in the Appendix) we thus arrive at the 
following coupled deterministic macroscopic equations: 



j^Q = 2r]J dxdy P[x, y] x Q{x- y\ (29) 

dj C 

—R = r] dxdy P[x, y] y Q\x\ y] (30) 
d Tj d Of 

—P[x,y] = ---^[g[x,y]P[x,y]] - V-g^ J dx'dy' A[x,y;x' ,y'] g[x',y'] (31) 

The difference between the macroscopic equations for batch and on-line learning is merely the presence 
(on-line) or absence (batch) of those terms which are not linear in the learning rate rj (i.e. of order r/^ 
or higher). 

2.5 Closure of Macroscopic Dynamical Laws 

The complexity of the problem is fully concentrated in the Green's function A[x, y; x', y'] defined in 
(|22|). Our macroscopic laws are exact for — > oo but not yet closed due to the appearance of the 



microscopic probability density Pt{J) in the sub-shell average of (22). We now close our macroscopic 



laws by making, for A^ oo, the two key assumptions underlying dynamical replica theories: 

1. Our macroscopic observables {Q,R,P} obey closed dynamic equations. 

2. These macroscopic equations are self-averaging with respect to the disorder, i.e. the microscopic 
realisation of the training set D. 

Assumption 1 implies that all microscopic probability variations within the {Q,R, P} sub-shells of 
the J^-ensemble are either absent or irrelevant to the evolution of {Q, R, P}. We may consequently 
make the simplest self-consistent choice for pt{J) in evaluating the macroscopic laws, i.e. in (p^): 
microscopic probability equipartitioning in the {Q, R, P}-subshells of the ensemble, or 

Pt{J) ^ w{J) ^ 6[Q-Q[J]]6[R-R[J]]l[5[P[x,y]-P[x,y;J]] (32) 

xy 

This new microscopic distribution w{J) depends on time via the order parameters {Q,R,P}. Note 



that (32) leads to exact macroscopic laws if our observables {Q, R, P} for N ^ oo indeed obey closed 
equations, and is true in equilibrium for detailed balance models in which the Hamiltonian can be 
written in terms of {Q, R, P}. It is an approximation if our observables do not obey closed equations. 
Assumption 2 allows us to average the macroscopic laws over the disorder; for mean-field models it is 
usually convincingly supported by numerical simulations, and can be proven using the path integral 
formalism (see e.g. |jl^]). We write averages over all training sets D C {—1,1}^, with \D\ = p, as 



{■ ■ ■)s- Our assumptions result in the closure of the two sets ( P6iP7]j2q ) and ( pg|j30|j3l|) , since now the 
function A[x,y; x' ,y'] is expressed fully in terms of {Q,R, P}: 

/JdJ w{J) mx-J-i] 5[y-B-^] {i-O 5,,, 6[x'-J.a 5[y' -3-^)^))^ 
N^OD \ JdJ w{J) /s 
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The final ingredient of dynamical replica theory is the realization that averages of fractions can be 
calculated with the replica identity 

Since each weight component scales as J" = 0{N^^) we transform variables in such a way that our 
calculations will involve objects: 

(Vi)(Va) : Jf = (Q/iV)5af , Bi = N'hi 

This ensures erf = 0(1), Tj = 0(1), and reduces various constraints to ordinary spherical ones: 
■'^ = N for all Q. Overall prefactors generated by these transformations always vanish due 



to n ^ 0. We find a new effective measure: Y\a=i w{J"^) dJ" YV^=i 'w{cr") dcr°', with 



N-a' 



w{a) 6 
We thus arrive at 

/n 
.. . ... _ 1 



NRQ-'^-T-a] n 6 [P[x, y]-P[x, y; (Q/iV)^cr] 



(33) 



xy 



N 



X 



N 







6 





ID D a 



(34) 



In the same fashion one can also express P[x,y] in replica form (which will prove useful for normal- 
ization purposes and for self-consistency tests): 



/n 
TT w{a'^)d> 



N 



Id B, 



(35) 



Finally we will have to demonstrate that the two functions . .] and C[. . .], as defined in (23 J2^), do 
indeed vanish self-consistently, as claimed. To achieve this we again express them in replica form: 



/ri 
n w{cT'')da^ 
™ a=l 



N 



N 



and 



C[x,y;x',y';x",y"] = lim j Y[ w{cr°')dcr^ 

N^oo 



a=l 



N 



6 ^^,,5^,^ N ^ 



ID D a 



(36) 
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At this stage the physics is over, what remains is to perform the summations and integrations in 
(p3M||J3^) in the hmit N ^ oo. Full details of this exercise are given in Appendix A, where 
we show that (|36| ) and (|3^) are indeed zero, and where we derive, in replica symmetric ansatz, an 
expression for the Green's function (^). It turns out that to calculate this Green's function . .] 
one has to solve two coupled saddle-point equations at each time-step, one scalar equation relating 
to a spin-glass order parameter q, and one functional saddle-point equation relating to an effective 
single-spin measure. 



3 Summary of the Theory and Connection with a^oo FormaUsm 

In this section we summarize the results obtained so far (including the replica calculation in Appendix 
A), and we show that our general theory has the satisfactory property that it incorporates the standard 
formalism developed for infinite training sets (with Gaussian joint field distributions P[x,y] at any 
time) as a special case, recovered in the limit a ^ oo. In addition we provide a proof of the uniqueness 
of the RS functional saddle-point equation and show that it can be found as the fixed-point of an 
iterative map. 

3.1 Summary of the Theory 

Our theory can be summarized in the following compact way: 



Dynamic Equations for Observables 

Our observables are Q = J^, R = J-B, and the joint distribution of student and teacher fields P[x,y] = 
{S[x — J ■ ^]S[y — B ■ For ^ oo these quantities obey closed, deterministic, and self-averaging 

1 12 

macroscopic dynamic equations. One always has P[x,y] = P[x\y]P[y] with P[y] = (27r)~ 2 e~ 2?^ . We 
define {f[x,y]) = JdxDy P[x\y]f[x,y], with the familiar short-hand Dy = (27r)~2e~2J^ dy, and the 
following four averages (the function will be given below): 



U = {^[x,y]g[x,y]) V = {xg[x,y]) 

For on-line learning our macroscopic laws are 

d 



dt 



W = {yg[x,y]) Z = {g^[x,y]) 



R = r^W 
dt ' 



(38) 



(39) 



dt 



P[x\y] 



dx'P[x\y\ [5[x—x' — rjg[x\y^ — 5[x — x\\ — rj 



d_ 

dx 



1 2^9^ 



2' 9x2 
For batch learning one has: 



+ nrZ—P[x\y] - rj V-RW-{Q-R')U 



d_ 
dx 



P\x\y] [U{x-Ry)+Wy] 
P[x\y\<l>[x,y\ 



dt 



P[x\y] 



—R = r]W 
dt ' 



P[x\y] [U{x-Ry) + Wy] 



(40) 
(41) 
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V-RW-{Q-R^)U 



d_ 

dx 



P[x\y]<^[x,y] 



(42) 



Note that the batch equations fohow from the on-hne ones by retaining only terms which are hnear 
in the learning rate. From the solution of the above equations follow, in turn, the training- and 
generalization errors: 

Et = {e[-xy]) Eg = - arccos[i?/yQ] (43) 

TT 

Saddle-Point Equations and the function $ 



The function <I>[j;,y] appearing in the above equations (generated by the Green's function ..]) 
is expressed in terms of auxiliary order parameters. These come about in the replica calculation 
of Appendix A, where the order parameters are defined through Dirac 5 functions in their integral 
representation (similar to equation |l3|). The first auxiliary order parameter is a spin-glass type order 
parameter q = {{J)'^)j^/Q, with B? /Q < q < 1- The second, defined similarly for the joint probability 
P\x,y\ is the function xl^^il/] (foi' details see the Appendix). The latter is not necessarily normalised 
and in what follows it is useful to consider the effective measure M[x,y] which is related to xl^^)?/] 
through a simple transformation (equation 111 ). The measure M[x,y] is non- negative and can be 
always normalized such that jdx Af[x,y] = 1 for all y € 5R, as emphasized in our notation by writing 
M[x,y] — > M[a;|y]. The auxiliary order parameters are calculated at each time-step by solving the 
following two coupled saddle-point equations: 



{{x-Ryf) + {qQ-R^){l--) 

a 



l+q-2R^/Q 
l-q 



DyDz {x\-{x)l 



in which 



{f[x,y,z]). 



P[X\y] = jDz {S[X-x]), 
Jdx M[x\y]e'''-'f[x,y,z] 



B 



Jdx M[x|?/]e^^^ Q{1 
After q and M[x|y] have been determined, the key function in (|3^j40| , ^2|) is calculated as 



or, equivalently: 



<^[X,y] = !^Q{l-q)P[X\y]j 'j Dz{X -xU6[X -x]). 



qQ-R^P[X\y] 



Dz z {5[X-x]), 



(44) 
(45) 

(46) 

(47) 
(48) 



Finding a saddle-point problem for an order parameter function, rather than a finite number of scalar 
order parameters, introduces the possibility of a proliferation of saddle-points. In the next section we 
will show that this does not happen: the solution of the functional saddle-point problem is unique, 
and can even by found iteratively by executing a specific non-linear mapping. 



3.2 Uniqueness and Iterative Calculation of the Functional Saddle-Point 

The uniqueness proof is more easily set up in terms of the original order parameter function x[x,y], 
rather than the new (normalised) measure M[x|y] (see the Appendix). For a given state {Q,R,P} 
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and a given value for q G [B? /Q, 1] we have to find the functional saddle-points of the functional \I'[x], 
defined as: 

^f[x] = aj DyDz log J dx g" - J DydxP[x\y]x[x, y] (49) 

Our proof will carry the existence of the various integrals as an implicit condition for validity. To 
reduce notational ballast we define 

w{x, y, z) = — — — — — , {f[x, y, z])^ = dx w{x, y, z)f[x, y, z] 

Note: w{x,y,z) = M[x\y\e^^^ / jdx' M[x'\y]e^^'^ . The function t;, z) obeys 

6w{u, V, ) _ ^-1 [§Ui—u']w(u, V, z) — w(u, V, z)w(u' ,v, z)] 

6x[u',v'] 

The functional saddle-point equation is obtained by requiring the first functional derivative of ^[x] 
with respect to x[u, v] to be zero for all u,v £ ?R., where 



6^ e-^" 



Dz w{u,v^ z) — P[u\v\\ (50) 



6x[u,v\ ^ 

Clearly, if the function xi^)?/] is a saddle-point, then also the function xi^^)?/] + p{v) for any p{y). 
This degree of freedom is irrelevant because such terms p{y) will drop out of the measure (. . 
Furthermore, one immediately verifies that transformations of the form x[x-, y\ — > x\x, y]+p(y) leave the 
functional ^[. . .] ( |49| ) invariant. Next we calculate the Hessian (or curvature) operator H[u, v; u' ,v'; x], 
using (H): 



H[u,v;u',v';x] 



Sx[u,v]5x[u',v'] 



1 2 

e~2'" j" ^ 6w{u,v,z) 



2tt J ^X[u\v'] 



e 2" 



5[v — v'] — j= IDz\_6[u—u]w{u,v,z)—w{u,v,z)w{u^v^z)\ (51) 



H\u,v\u' ^v' ]x\ is non-negative definite for each X; and thus the functional ^ is convex, since for any 
function (j)[u, v] for which the relevant integrals exist we find 



j dudvdu'dv' 4'[u,v\H[u,v;u' ,v' ■,x\4'[u' ^v'] = — J DvDz ((/)2[n, v])^ — {(f)[u,v]) 



> 



The kernel of H[u,v;u' ,v';x]j ^01 a given 'point' x in X"Space, is determined by requiring equality in 
the above inequality, i.e. 

2 9 

for each v,z {[4>[u,v] — {(/}[u,v])J )* = so —(j)[u,v] = 

ou 

For each x the kernel of the second functional derivative H[x, y; x' , y'; x] thus consists of the set of all 
(integrable) functions (f>[x, y] which depend on y only. 

We now find that, if xo[x, y] and Xi[x, y] are both functional saddle-points of ^'[x]) then xi[x, y] — 
Xo[x, y] = p{y) for some function p{y)- In other words: apart from the aforementioned irrelevant degree 
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of freedom, the solution of the functional saddle-point equation ( |4q ) is unique. To show this, consider 
two functions Xol^^)?/] and Xi[3^)2/] which are both functional saddle-points of i.e. corresponding to 
solutions of (Hsl). Define a path {xt} through x-space, connecting these two functions: 



Xt[x,y] = xo[x,y] +t{xi[x,y] - xo[x,y]} , t £ [0, 1] 

Integration along this path will bring us from xo to xi- Thus for any functional L[x\ one has 

fxi f fxi SL 

L[xi] - L[xo] = / dL[x\ = / dudv / dx[u,'^ ^ 



' Xo Xo 

dudv[xi[u,v\-XQ[u,v\] / dt 



5x[u,v] 



5x[u,v\ 



Xt 



For the functional L[x\ we now choose a functional first derivative of ^[x]-, i-e. L[x\ = 8^ / 5x\x,y] for 
some x,y Since both xo and xi are saddle-points one finds L[xo] = L[xi] = 0. Thus 



1 (^2^ 

dudv [xi[u,v] — xo[u,v]] I dt 



= 

Xt 







/o 6x[u,v]6x[x,y] 
Multiply both sides by Xi[x,y]—Xo[x,y] and integrate the result over x,y 

dt Jdudvdxdy[xi[u,v]-xo[u,v]]H[u,v;x,y;xt] [xi[x,y]-Xo[x,y]] =0 

One concludes (since the Hessian is a symmetric non- negative operator): 

for ah t £ [0,1], u,v G^: J dxdy H[u,v;x,y;xt][xi[x,y]-Xo[x,y]] = 

The function Xi[2;)2/]~Xo[^)y] is in the kernel of H\y,^ for any t G [0, 1]. The kernel of H was already 
determined to be the set of all integrable functions which depend on y only, whatever the point x 
where one chooses to evaluate H. Hence Xi[x,y] — Xo[x,y] = p{y) for some function p{y). Finally, the 
remaining freedom in choosing a function p is eliminated by our normalisation Jdx M[x\y] = 1 (for 
each y), so that the solution M[2;|y] is indeed truly unique. 

Next we will show how for any given value of the scalar order parameter q and the observables 
{Q,R,P} (and thus of B), for which the relevant integrals exist, the unique solution M[x|y] of the 
functional saddle-point equation (|45| ) can be constructed as the stable fixed-point of the following 
functional map: 

P[x\y] ijDz Ijdx' e^^(^'-^)M£[x'|y]]~H 

for each y G 3f? : Me+i[x\y] = ^ — (52) 

Jdu P[u\y] {/Dz [Jdx' e^^(^'-«)M4x'|y]]"^} 

Clearly all fixed-points of this map correspond to normalised solutions M[x|y] of a functional saddle- 
point equation (^), of which there can be only one. Thus we only need to verify the convergence 
of (^), which can be done most efficiently using an appropriate Lyapunov functional. Note that the 
functional (|49|) can be written as 

'^'[Mly] = a j Dy ^[M[y] + terms independent of M[. . .] 
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with 



^[M\y] = Dz log / dx M[x\y]e 



^Bxz 



dxP[x\y\ log M[x|y] 



(53) 



For any given y £ ?R. we will show (q^) to be a Lyapunov functional for the mapping (p^), i.e. ^[M|y] 
is bounded from below and monotonically increasing during the iteration of (^2|) with stationarity 
obtained only when M[. . .] is the (unique) fixed-point of (^). First we prove that a lower bound for 
^ is given by the entropy of the conditional distribution 



(54) 



for any M[. . .] and any y G : ^'[Af |y] > — f dx P[x\y] log P[x\y] 

The proof is elementary (using Jenssen's inequality): 

^M\y] = [ Dz \og\ [ dx p[x\y]e^^'+^°sMlx\y]-iogP[x\y] I _ I p[^|y] \og M[x\y] 



> Jozjdx P[x\y] {Bxz + log M[x\y]- log P[x\y]} - J dx P[x\y] logM[x|y] 

dx P[x\y] logP[x|y] 

Secondly we show that (|53D indeed decreases monotonically under (^) until the fixed-point of (^2|) is 
reached. To do so we introduce the short-hand notations Xe{x, y, z) = Bxz + log M([x\y] — log P[x\y], 
{f[x\) = Jdx P[x\y]f[x], and 



Vi{x,y) = [Jdz eAKw)(eA,(z',y,^)^-i| 



The iterative map can now be written as 

Mi+i[x\y] 



Me[x\y]vi{x,y) 



Jdu Me[u\y]vi{u,y) 

This gives for the change in . .] during one iteration of the mapping, again with Jenssen's inequality: 



^[Me+i\y] - ^[Me\y] = J Dz log 

Dz < log 



Jdx Mi+i[x\y]e 
^^ Jdx M^[x|2/]e-s^^ 

(e>^ei^'y'^)veix,y)) \ 



Bxz 



dxP[x\y] log 



Me+i[x\y] 
Mi[x\y] 



(^gXe{x,y,z)'j 



{log Vi{x,y)) 



< log \ {ve{x,y) Dz 



\e{x,y,z) i^^\e{x' ,y,z)\^-l 



(log Vi{x,y)) 



-{logViix,y)) = {log J Dz eA,(x,?/,^)^gA,(x',j/,^)^-i^ 
< log / Dz (e^Hw))^eAKx',s/,^)^-i ^ q 



Finally we round off our argument by inspecting the implications of having strict equality in the 
above inequality. Equality can only occur if at both instances where Jenssen's inequality was used in 
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replacements of the form {\og{X)) < log(X) the relevant stochastic variable X was a constant. In our 
problem this gives the two conditions 



dz (^(^\i{x,y,z)\^ ' Qx 

If the second condition is met, the first immediately follows. Working out the second condition gives, 
in combination with the property that -P[x|y] is normalised: 

M,[x\y]e^-^ _ 
jdx' M,[x'\y]eB-'- - n^Wl 

Thus we have confirmed that ^'[M^-|_i|y] = ^'[M^ly] if and only if M^[. . .] is the (unique) fixed-point 



of (52) 



As a consequence of the above we may now write the normalised solution of our functional saddle- 
point equation (^5|) in terms of repeated execution of the mapping (|5^ following an an in principle 
arbitrary initialisation: 

for ally e^: M[x\y] = lim Mi[x\y], Mo[x\y] = P[x\y] 

This property simplifies the numerical solution of our equations drastically. 

3.3 Fourier Representation and Conditionally-Gaussian Solutions 

There are two potential advantages of rewriting our equations in Fourier representation. Firstly, 



after a Fourier transform the functional saddle-point equation (45) will acquire a much simpler form. 
Secondly, in those cases where we expect to be of a Gaussian shape in x this would simplify 

solution of the diffusion equations (4C Ji^). Clearly, P[x,y] being Gaussian in {x,y) is not equivalent 



to -P[3;|y] being Gaussian in x only. The former requires 



—^jdx -P[Ay] = Q-y^j 



dx x'^P[x\y\ 



dx xP[x\y] 



0, 



which only will turn out to happen for a oo. A Gaussian with moments which depend on 

y in a non-trivial way, on the other hand, is found to occur also for a < oo, provided we consider 
simple learning rules and small ry (see |l^). To avoid ambiguity we will call solutions of the latter 
type 'conditionally-Gaussian'. 

We introduce the Fourier transforms 

P[k\y] = Jdx e^''''' P[x\y] M[k\y] = J dx e-^'=^M[x|y] (55) 

The transformed functional saddle-point equation thereby acquires a very simple form 

-r, , n f M\k+iBz\v] , , 
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Note that, in contrast to the original equation (p5|), the transformed equation ( |56| ) need not have a 
unique solution (it could allow for solutions corresponding to non-integrable functions in the original 
problem). Consider, for instance, the transformation 



M[k\y] ^ M[k\y] = 

with the property (verified by a simple transformation of variables): 

M[k+iBz\y] 



Dz 



M[iBz\y] 



ik/B+oo ^ M[k+iBz\y] 
ik/B-oo M[iBz\y] 



If M[/c], which by definition cannot have poles, is sufficiently well behaved, a simple deformation of 
the integration path (via contour integration) leads to the statement that if M[k\y] is a solution of 
(56), then so is M[A;|7/]. 

Transformation of the dynamical on-line equation (|40| ) for -P[x|?/] (from the which the batch equa- 
tion (42) can be obtained by expansion in r]) gives: 



di 



log P[k\y] 



1 

a 



dk 



, P[k'\y] fdx' 
P[k\y] 



2tt 



J.x' (k' —k)—i'qkQ[x\y] 



d 1 

+ r]kU —log P[k\y\ — -rfk^ Z — irjk 



V-RW-{Q-R^)U 



Dz z 



irik{W-UR)y 
M[k + iBz\y] 



MliBz] 



(57) 



We now determine the conditions for equation (^) to have conditionally- Gaussian solutions. If 
P[x\y] is Gaussian in x we can solve the functional saddle-point equation (^) (whose solution is 
unique) , and find the resulting pair of measures 



P[x\y] 



e 2 



^<x-x{y)]yA^(y) 



, Mlxly] 

Aiy)V27T 

A\y)=a\y) + B^a\y) 



e 2 



a{y)V2Tr 



(58) 



(59) 



with their Fourier transforms I y] = exp — i/cx(y) — ^A;^A^(y) andM[A;|y] = exp -^kx{y) — ^k^a'^{y) 
Insertion of these expressions as an Ansatz into (^), using the identity 

M[k + iBz\y] 



Dz z 



ikBa\y)P[k\y] 



M[iBz\ 

and performing some simple manipulations, gives the following simplified equation: 

du 



-ik^x{y)--k'^^/\^{y) = - 
dt 2 dt a 



2tt 



-^[u-ikA{y)]^-ikvg[x{y)+u/liy),y] _ ^ l^^fcjp^y + U[x{y)-Ry]} 



--k^r]^Z + 2r]UA\y) +2r]a\y) 



V-RW-{Q-R^)U 



(60) 
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Prom this it follows that conditionally-Gaussian solutions can occur in two situations only: 



a 



oo 



or 



du 



e 2 



^[u-ikA{y)]^-ikriglx{y)+uA{y),y] 



(61) 



The first case corresponds to the familiar theory of infinite training sets (see next section) . The second 
case occurs for sufficiently simple learning rules in combination either with batch execution 

(so that of ( |6l[ ) we retain only the term linear in rj) or with on-line execution for small rj (retaining in 
(|6l|) only 7] and r/^ terms). The latter cases will be dealt with in more detail in [17|. 

3.4 Link with the Formahsm for Complete Training Sets 

The very least we should require of our theory is that it reduces to the simple {Q,R) formalism of 
complete training sets ^ in the limit a — > oo. Here we will show that this indeed happens. In 
the previous section we have seen that for a — > oo our driven diffusion equation for the conditional 
distribution -P[2;|y] has conditionally-Gaussian solutions, with Jdx xP[x\y] = x{y) and jdx [x — 
x{y)\'^ P[x\y\ = A^(y). Note that for such solutions we can calculate objects such as (j;)^, and the 
function (]47| ) directly, giving 

X — x{y) 



(x)^ = x{y) + zBa (y) 



^[x,y] 



Q(l-g)[l + i?V2(y)] 



with A^(y) = a'^{y)+B'^a'^{y) and B = y/qQ — R"^ /Q(l—q). The remaining dynamical equations to be 
solved are those for Q and R, in combination with dynamical equations for the y-dependent cumulants 
x{y) and A^(y). These equations reduce to: 

' 27]{xg[x,y]) +r]'^{G'^[x,y]) (on-line) d 

27]{xg[x,y]) 

1 d 



d_ 
di 



Q 



1 d 
27/ 



A^{y)-Q+R' 



x{y)-Ry 
■qdt \_ 

{{x-Ry')G[x ,y']) 



(batch) dt 

' \x{y)-Ry]{^x ,y']Q[x ,y]) 



R = v{yQ[x,y]) 



(62) 
(63) 



Q{1- 



1 



+ mx',y']g[x',y']) 



A'(y)- 



Q{l-qW{y) 
(64) 

with one remaining saddle-point equation to determine q, obtained upon working out ( |4^ ) for conditionally- 
Gaussian solutions: 



J Dy{[x{y)-Ryf + A\y)} + qQ-R^ 



2^ + 1 



Q{l-q) 

We now make the Ansatz that x{y) = Ry and A^(y) = Q — R^, i.e. 



Dy a\y) 



P[x\y] 



e 2 



\[x~Ry]^/{Q-R^) 



(65) 



(66) 



v/27r(Q - R') ' 

Insertion into the dynamical equations shows that (^) is now immediately satisfied, that (|6^ reduces 
to cr^(y) = Q(l — g), and that as a result the saddle-point equation ( |65[ ) is automatically satisfied. 
Since (|6^) is parametrized by Q and R only, this leaves us with the closed equations 

2rj{xQ[x,y\) + rf'{Q'^[x,y]) (on— line) 



dt 



Q 



2r]{xg[x,y]) 



(batch) 



d 
di 



R = v{yO[x,y]) 



(67) 
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These are the equations found in e.g. 
we thus indeed recover in the hmit a - 
complete training sets, as claimed. 



1^, From our general theory for restricted training sets 
oo the standard formalism (66,^) describing learning with 



4 Discussion 

In this paper we have shown how the formalism of dynamical replica theory (see e.g. [|13|) can be 
successfully employed to construct a general theory which enables one to predict the evolution of 
the relevant macroscopic performance measures for supervised (on-line and batch) learning in layered 
neural networks, with randomly chosen but restricted training sets, i.e. for finite a = p/N where 
weight updates are carried out by sampling with repetition. In this case the student nodes local 
fields are no longer described by (multivariate) Gaussian distributions and the traditional and familiar 
statistical mechanical formalism consequently breaks down. For simplicity and transparency we have 
restricted ourselves to single-layer systems and realizable tasks. 

In our approach the joint field distribution P[x,y] for the student and teacher local fields is it- 
self taken to be a dynamical order parameter, in addition to the conventional observables Q and R 
representing overlaps between the student-student and student-teacher vectors respectively. The new 
order parameter set {Q, R, P}, in turn, enables one to monitor the generalization error Eg as well as 
the training error Et. This then results, following the prescriptions of dynamical replica theory]^, in a 
diffusion equation for P[x, y], which we have evaluated by making the replica-symmetric ansatz in the 
saddle-point equations. This diffusion equation is generally found to have Gaussian solutions only for 
a ^ oo; in the latter case we indeed recover correctly from our theory the more familiar formalism 
of infinite training sets (in the iV— >oo limit), providing closed equations for Q and R only. For finite 
a our theory is by construction exact if for N ^ oo the dynamical order parameters {Q,R,P} obey 
closed deterministic equations, which are self-averaging (i.e. independent of the microscopic realiza- 
tion of the training set). If this is not the case, our theory can be interpreted as employing a maximum 
entropy approximation. In a sequel paper |17| we will work out our equations explicitly for various 



choices of learning rules, and compare our theoretical predictions both to exact solutions, derived for 
special cases directly from the microscopic equations Q and with numerical simulations. We will also 
construct a number of simple but effective approximations to our full equations. As it will turn out, 
our theory describes the various learning processes examined highly accurately. 

The present study represents only a first step in understanding on-line learning with restricted 
training sets. It opens up many extensions, applications and generalizations that can be carried out 
(some of which are already under way). Firstly, our theory would simplify significantly if one could 



find a more explicit solution of the functional saddle-point equation (96), enabling us to express the 
function y] directly in terms of our order parameters. The benefits of such a solution will become 
even greater as we apply our theory to more sophisticated learning rules, such as to perceptron or 
AdaTron learning, or to learning in multi-layer networks (which run the risk of requiring a serious 
amount of CPU time). Secondly, this theory opens up new possibilities for considering unrealizable 
learning scenarios, either due to structural limitations or due to noise, which require some sort of 

^The reason why the rephca formahsm is inevitable (unless we are willing to pay the price of having observables with 
two time arguments, and turn to path integrals) is the necessity, for finite a, to average the macroscopic equations over 
all possible realizations of the training set. 

*Such exact results can only be obtained for Hebbian-type rules, where the dependence of the updates AJ{t) on the 
weights J{t) is trivial or even absent (a decay term at most), whereas our present theory generates macroscopic equations 
for arbitrary learning rules. 
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regularization. The examination of regularization techniques in such scenarios, which is of great 
practical significance, was out of reach so far as they come into effect only where the error-surface is 
fixed by having a fixed example set. Thirdly, at a more fundamental level one could explore the effects 
of (dynamic) replica symmetry breaking (by calculating the AT-surface, signalling instability of the 
replica symmetric solution with respect to replicon fluctuations), or one could improve the built-in 
accuracy of our theory by adding new observables to the present set (such as the Green's function 
A[x, y; x' ,y'] itself). Finally it would be interesting to see the connection between the present formalism 
and a suitable adaptation of the generating functional methods, as applied in |1C] to networks with 
binary weights, to the learning processes studied in this paper. 
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A Replica Calculation of the Green's Function 



The main objective of this Appendix is to calculate the Green's function A[. . .], with which we obtain 
our macroscopic dynamic equations in explicit form. We first carry out the disorder averages, leading 
to an effective single-spin problem. The integrations are done by steepest descent, giving a saddle- 
point problem for replicated order parameters at each time step. In the saddle point equations we 
then make the replica symmetry (RS) ansatz, so that the limit n ^ can be taken. In addition we 
show that the two functions B[. . .] and C[. . .] do indeed vanish, as claimed. 



A.l Disorder Averaging 



The fundamental quantities A[x, y; x', y'] , B[x, y; x', y'] , C[x, y; x', y'; x", y"] , and P[x, y] , which control 
the macroscopic equations can be written as 

P[x,y] 
A[x,y; x',y'] 
B[x,y;x',y'] 
C[x,y;x',y';x",y"] 



lim 

N^oo 



l[\6\N-{a-f 



NR 



Tcr 



da" Y[ 5 



N 



X — - 



N 



L/ v^cri.^'l 


5 


y'- 











6 ^^'6 



^ Vn 



y Vn 



ID ID ID IS 



We next use the definition of P[x,y;J], introduce integral representations for the (^-distributions 
involving P[x,y], and obtain 



P[x,y] 
A[x,y;x',y'] 
B[x,y;x',y'] 
C[x,y;x',y';x",y"] 



>= lim Y\{s\N-(a 

N^OoJ a [ ^ 



a\2 



NR 



Tcr 



da'' 



/ >./// ^ni 



N 



JN-TTa [Xa,ya]PlXa,ya] 



d-n{Xa,ya) 
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X < 



X 



X — 



Vn 



y 



Vn 



Vn 



X — 



ID D D /S 



The summations involving {xa,ya) automatically lead to integrals, which can be performed due to 
the (5-distributions involved. We define new conjugate functions Pa[x,y] via 

X] [^a ,ya]f[Xa, Va] ^ j dx" dy" Pa [x" , y"] f [x" , y"] 

We write averages over the training set explicitly in terms of the p = aN constituent vectors {^^}. 

Finally we introduce integrals representations for the remaining delta-distributions, and obtain the 
following expressions (at this stage we will have to separate the various structurally different cases): 



P[x,y]= (^S^-+yy\ lim 



{27T) 

II {5 iV-(0' 



N^oo 
NR 



VQ 



-T ■ a 



d^a^iNjdx'W P.lx",y'W,y"] -Q dPa[x",y"]\ - 

x"y" I P n=l 



A[x,y;x' ,y'\ I ja^a/ jz-.^^v 



B[x,y;x',y'[ 



r dxdx'dydy' il^i+,,'£'+yy+yy'] ^.^ 

J (27r)4 N-.^ 



(68) 



N-^oo 



NR 



-T ■ a 



d„a^iNjd."dy" P.W',y"]PW',y"] TT dPa{x\y")\^ V 



IJ.^u=l 



J_ el^ ev tl^ tv 

N '^i 'ij 



(69) 

C[x, y; x', y'; x" , y"] = [ ^^^^'^^''^^^i^'#'' ^ife£+x'x'+x"ai"+j;i/+j;'i/'+it/"j/"1 



/n 



NR 

7Q 



— T ■ a 



d^a^^NJd."dy"P.[^",y"]P[^",y"] dPa{x" , y")\ 5,-5., 

x"y" J P fivp=l 



i E„ Ex Pc.[ ^%'^\ ^]-i[^VQCT'-i'+yr-e+i'VQcT^-C+y'T-C+^''VQ^^ 

(70) 



28 



The averages over the training sets (. . in (68,5^70) will now be done separately. First we define 
some relevant objects: 

V[u, V] = ^e-^ ^.(^^-^)-[«VQa^.^+.r.^]/vW^ ^ ^^^^ 

£ij[u,v] = (N£,i^j e 'v^^ ^ '^"^ \ {^ 7^ j) (73) 

As we will see, all are of order 0{N^) as ^ oo. We next use the permutation invariance of our 
integrations and summations with respect to pattern labels. First we calculate the first training sets 
average occurring in (|68|): 



(74) 



log ©[0,0] '^[^^y] 



V[0,0] 

The prefactor eP^°s^[°'°], will turn out to take care of appropriate normalisation, and will drop out 
of the final result for all four functions P[x,y], A[x,y; x' ,y'], 13[x , y; x' , y'] and C[x,y;x',y';x",y"]. 
Secondly we evaluate the training sets average of the expression for A[. . .] in (|K 



P 

J 



(75) 



(provided we indeed show that £j[u,v] = 0{N^) as N ^ oo). Secondly, the training sets average of 
the expression for S[. . .] in ( |69|) is given by: 



(^Em;eje---\ = ^E((e.^el)(4^e|) e-\ 
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(provided we indeed show that £ij{u,v\ = 0{N^) as N ^ oo). Finally we also obtain for the training 



sets average in (7C), in a similar fashion: 



1 / 1 



p=l n,u^p / ^ F 



+ V[x\ y'%[x, y]£,[x\ y'].0{N-^) 

i 



(77) 



We now work out (72) and we show that it is of order . This is achieved by separating in the 
exponent the terms with site label i = j from those with site labels i ^ followed by expansion in 
powers of the (relatively small) i = j terms, and will involve the following two functions: 



N x/N 



(78) 



(79) 



Note that there is no need to calculate the auxiliary functions (73); we only need to verify their 
magnitude to scale as 0{N^) ioi N ^ co. 



X 
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+ u^/Qa] + vTj+0{N-^2[ 



so that 



fjKz;] = -iu^/Qa]V[u,v] - ivTjV[u,v] - - ^ f^"-^" K - -TjY,T^[u,v] + ©(TV"^) 



a 



-Ti[u,v] + udai'Dlujv] 



a 



+ 0{N-2) (80) 



Repetition/extension of this argument, by separating in the exponent terms with two special indices 
rather than one, and by subsequent expansion (whereby each term brings down a factor N~2)^ 
immediately shows that terms of the form {N^i^j e ")^ with i ^ j will be of order 0{N^). This 
confirms that £ij[u,v] = 0{N^) and that (|7^ ) indeed scales as indicated. Note that the relevant 
combination of intensive terms in ( |75[ ) can be abbreviated as £[u,v;u' ,v'] = jj J2j ^jW^ ^ '^']- 



C[u,v;u',v'] = -Q^ga/3({o-}) 

a/3 
1 



—J^? [u,v] + uSalD [u, v] 

a 



a 



—Tf [u, v] + u5aiD[u, v] 



E 

0/3 



a/3 

-^E 

a/3 

a 



+ 0{N- 



(81) 



where we have used the built-in properties -^T-a'^ = R/y/Q and = and in which we find the 
spin-glass order parameters 



MW) = ^E<'^f 



N 



(82) 



Let us finally work out further the remaining fundamental objects T)[. . .] and J-i2\- • •]■ The basic 
property to be used is that for large A'' the n+1 quantities {xa = cr°'-^/VN, y = t-^/\/TV} 
inside averages of the form {■ ■ will become (zero average but correlated) Gaussian variables, with 
probability distribution 



P(xi, 
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This allows us to write 



'D\u,v] 



det2 A 
(27r)("+i)/2 



■A 



dxdy e 



V y J 



det2 A 
(27r)("+i)/2 



X-n 

\y J 
V y J 



■A 



V y / 



(83) 
(84) 



Note that these quantities depend on the microscopic variables cr" only through the macroscopic 
observables qa/3{{o-}). 

A. 2 Derivation of Saddle-Point Equations 

We will now combine the results (74,75, [7^j77| ) and ( |8l|) with the expressions (|6|j6|,|7|). We use integral 
representations for the remaining delta functions, and isolate the observables Qajs, by inserting 

dqdqdQdk ^iN[Y,JQc.+KR/^)+Y.^^q^pq^p]-iY.,Y.^Q^i''?)''+^^^ 
(27r)"^+2™ 

We hereby achieve a full factorisation over sites in the relevant quantities (note: the objects ^?[. • •] 
and £[. . .] depend on the microscopic variables only via ^^/^({cr})): 



dxdx'dydy' 



,i[xx+x' x' +yy+yy 



yy'U\m. lim (dqdqdQdk FT dPo,{x" ,y") 
N^oo J 



C[x,y;x',y' 
p2[o,0] 



and 



P[x,y] 



dxdy 



j-j^ lim [dqdqdQdk TT dPa{x",y") 



icr e 



P[0,0] 

Both can be written in the form of an integral dominated by saddle-points: 

'dxdx'dydy' ,r„j.i^'^' 



A[x,y;x' ,y'\ 



^i[xx+x X +yy+yy 
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and 



with 



m...]=i 



ax"y" 



P[0,0] 



J2{Qa + RaR/VQ) + i E 9a/39a/3 + ^ E / dy" Pa{x" , y")P[x" ,y"] 

+alogP[0,0]+ lim ^Vlog Ltr e-'^-t^°"'+^'^"'"'^l-*S-/3^'^''"""'' 

N^oo N J 

Finally we use that fact that the above expressions will be given by the intensive parts evaluated in the 
dominating saddle-point of ^. We can use the expression for P[x, y] and its property / dxdy P[x^ y] = 1 
to verify that all expressions are properly normalised (no overall prefactors are to be taken into 
account). We perform a simple transformation on some of our integration variables: 



^o/3 Qaf3 — Qa^af} 



Ra —>■ \/QRc 



and finally we get 



P[x,y] 



dxdy j[^^+y^^ V[x,y] 



(85) 



(86) 



(27r)2 n^OV[0,0] 

in which all functions are to be evaluated upon choosing for the order parameters the appropriate 
saddle-points of ^ (variation with respect to q, q, Q, R and {P}), which itself takes the form: 



^^Qai^-qaa) + iR'^Ra + i'^QalSqalB + / dx" dy" Pa{x" ,y")P[x" ,y"\ 

a a al3 « 

+ alogP[0,0]+ lim ^Vlog f da e~'^"^^-^'"'''-'^-0^'^^'""'^ 



(87) 



With P[. . .] given by (p3D, which depends on the variational parameters {P} and only. The 
function C[. . .] is given by (81). The order parameters qa/3 have the usual interpretation in terms of 
the average probability density for finding a mutual overlap q of two independently evolving weight 
vectors (J", J^), in two systems a and b with the same realization of the training set (see e.g. 0]): 



PiQ) 



J" J'' 



U'^IIJ^I 



n^o n[n — l) ^„ 



Note that upon applying the above procedure to the functions . .] and C[. . .] in (^£0) we find again 
integrals dominated by the dominant saddle-point of ^; here, in view of (|7^ and (7^, the intensive 
parts are zero, and thus 

B[x, y, x', y'] = C[x, y; x' , y\ x" , y"] = 

as anticipated earlier. 
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A. 3 Replica-Symmetric Saddle-Points 

We now make the replica symmetric (RS) ansatz in the extremisation problem, which according to 



is equivalent to assuming ergodicity. With a modest amount of foresight we put 

qal3 = qO^a/B + q['i--^a(3], qa/B = 7;'^[r -rodajs], Ra = ip, Qa = i(t), Pa[u, v] = ix[u, v] 



This converts the quantity ^ of equation (p7|) for small n into 
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with the abbreviation Dz = (27r)~ 2^ dz. We do the Gaussian integral in the last term, and expand 



the result for small n: 
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(90) 



Note that 'const' refers to terms which do not depend on the order parameters to be varied, and will 
thus not show up in saddle-point equations; such terms can, however, depend on time via quantities 
such as {Q, R). At this stage it is useful to work out four of our saddle-point equations: 



d(j) dr dp dvQ 
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qo = 1, 
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These allow us to eliminate most variational parameters, leaving a saddle-point problem involving 
only the function xl^^i 2/] ^-nd the scalar q: 



hm g, {x}\ = 7 + - log(l- 



a 



Finally we have to work out the RS version of v\ q, {x}]'- 
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dxdy e 
I 1 











■A 










V y ) 




\y J 



+ a x{VQxa,y)-i[u'jQxi+vy\ 



(92) 



q R/VQ\ 



q ■■■ 1 

\R/VQ ■■■ R/VQ 1 J 
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The inverse of the above matrix is found to be 



Cnl ■ ■ ■ Cnn 7 

\ 7 ••• 7 bj 



^al3 — Z " 

1-a 



6 = l + 0(n) 



With this expression, and upon linearising the terms in the exponents which are quadratic in x in the 
usual manner with Gaussian integrals, we obtain 



Jdx e 2(1-9) ^' " ' s/b' 



jDzDy 

For the saddle-point problem we only need to calculate lim„^o ^ log 0; q, {x}]'- 

lim - logP[0, 0; q, {x}] = lim - (log / DzDy \ [ dx g" ^{f^+l^^-^J'/^l^+i^lv^^'^/^l 



(93) 



log / DzDy 



a / DzDy log 



/dx e 2Q(i-9) 



hx [2\/d-7j/] / ix[a:,!/] 



with 7 and d evaluated in the limit n — > 0. Equivalently we can define 



which gives 



A = R/Q{l-q) B = JqQ-RyQ{l 



lim - log P[0, 0; q, {x}] = a / DzDy log { ^ ■ 



(94) 



Jdx e 



Upon doing the x-integration in the denominator of this expression we can write the explicit expression 
for the surface ^ to be extremised with respect to q and the function x[x,y], apart from irrelevant 
constants, in the surprisingly simple form (with the short-hand 



lim -^[g,{x}] = ^ " '^/^ ^(l-a)log(l-g) - / dxdy x[x,y]P[x,y] 



+ a J DzDy log J dx e 



(95) 



35 



Note that ( pq ) is to be minimised, both with respect to q (which originated as an n(n— 1) fold entry in 
a matrix, leading to curvature sign change for n < 1) and with respect to the function 2/] (obtained 
from the n-fold occurrence of the original function P, multiplied by i, which also leads to curvature 
sign change). 

The remaining saddle point equations are obtained by variation of ( |95| ) with respect to x q. 
Functional variation with respect to x gives: 



for all X, y : 



P[x,y] 



e 2 



Dz 



(96) 



Note that P[x,y] = P[x\y]P[y] with P[y] = {2it) ae ^ as could have been expected. Next we vary 
q, and use (|96[) wherever possible: 



2 ~i — IT = o / DzDy 



2{l-q) 
giving 



2(1-5) 



hx[Ay+Bz] + -^x[x,y] 



2- -^[y-d^ + 



2Q(l-g) 



dq- 



Jdx e 2Q(^-i) 



-x[Ay+Bz] + j-x[x,y] 



I 



dxdy P[x,y]{x-Ryf + [R^-qQ){ 1) 

a 



2JqQ-R^ + 



DzDy z 



Jdx e 



2Q(l-<7) 



+x[Ay+Bz] + j-x[x,y]^ 



Jdx e"5Qf^+^[^^+^^]+^^[^'fl 



(97) 



A. 4 Explicit Expression for the Green's Function 



In order to work out the Green's function (85) we need the function C[u,v;u',v'] as defined in (| 



which, in turn, is given in terms of the integrals (33,^). First we calculate the n — > limit of 
D[u,v;q,{x}] (H), and simplify the result with the saddle-point equation (|9^) : 



lim 'D[u,v;q,{x}] = DzDy e 



+x[Ay+Bz] + -^x[^,y]-iu^ 



Jdx e~W^+^[^^+^^l+^'^['^'S'] 



dx(iyP[x,y]e~™y-™^' (98) 
Next we work out the quantities Fi2['^^ ^] of equation (p4|) in RS ansatz, using Gaussian linearizations: 



lim jr"2[^t, v] = i lim 

n— >0 ' 



Jdxdy di^2X[VQxa,y] e 











■A 










\ y ) 




\ y 1 



Ha x[VQ^a,y\-i[u'/Qxi+vy\ 



J dxdy e 











1 

2 


Xn 

\ y J 


A 


Xn 

\ y ) 
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i lim / DyDz e'^^y / dx e 



^2 



d\x[vQxa,y] 



The rephca permutation symmetries of this expression ahow us to conclude 

lim^^^K^;] = 6aiFl[u,v] + {l-6ai)Fl[u,v] 

where 



(99) 



Fi2[u, v] = i DyDz e 



Fl,[u,v] = i J dxdy P[x,y]e-*^^-™^ai,2x[x,y] (100) 



-ivy 



We can now proceed to the calculation of (|8l|). First we note that the basic building blocks of 
are most easily expressed in terms of the functions 



(101) 



1 



Gi [u, v] = —T\ [u, v] + uV[u, v] 



a 



1 



Gi[u,v\ = -TI[u,v] 
a 



(102) 



G2[u,v] = -J='l[u,v]+vV[u,v] G2[u,v] = -Tl[u,v] (103) 

a a 

With these short-hands we obtain, upon performing the summations over replica indices in (^ij): 

C[u,v]u ^v'] = —Q{l — q)Gi[u,v\Gi[u,v'] — Q{l — q){n—l)Gi[u,v\Gi[u,v'] 
-Qq [Gi [u, v] + {n-l)Gi [u, v]] [Gi [u', v'] + (n- 1)^! [n', v'] 
-R [Gi[u,v] + (n-l)Gi[n,?j]] [Ca + {n-l)G2[u ,v'] 
-R [Gi[u\v'] + {n-l)Gi[u\v"\\ [G2[u,v] + {n-l)G2[u,v] 
- 'G2[u,v] + {n-l)G2[u,v^ \G2[u\v'] + {n-l)G2[u',v' 



and so 



lim C[u, v; u ,v'] = —Q{l — q) Gi[u, v]Gi [u ,v'] — Gi[u, v]Gi [u ,v'] 



—Qq Gi[u,v] — Gi[u,v] Gi[u' ,v'] — Gi[u ,v' 

—R Gi[u,v] — Gi[u,v] G2[u' ,v'] — G2[u' ,v'] —R Gi[u' ,v'] — Gi[u' ,v'] G2[u,v] — G2[u,v 

- G2[u,v] - G2[u,v] G2[u' ,v'] - G2[u ,v'] 
With the Fourier transforms of the functions G[. . .], given by 
dudv 



Gi[u,v] 
G2 [u,v] 



iuu+ivv 



(2vr)2 
dudv 



— jFf [u, v] + uVlu, v] 
a 



a 



[n, v] + vV[u, v] 



— If dudv 

^^[^'"] = ay(2^ 

:7=7 r- -1 1 /" dudv 

G2[U,V\ 



a J (2^)2 



e*""+^^^^f[n,z;] (104) 
^iuu+ivvjr2^^^^-^ (105) 
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the Green's function A[x,y]x\y'] (^) can now be written in explicit form as 

A[x,y]x,y'] = -Q{l-q) Gi[x,y]Gi[x\y] - Gi[x,y]Gi[x\y] 

-Qq Gi[x,y] - Gi[x,y] Gi[x' ,y'] - Gi[x' ,y' 



Gi[x,y]- Gi[x,y] G2[x' ,y'] - G2[x' ,y'] - R Gi[x' ,y'] - Gi[x' ,y'] G2[x,y] - G2[x,y] 

G2[x,y] -G2[x,y] G2[x',y'] -G2[x,y'] 



-R 

Finally, working out the four relevant Fourier transforms, using ( p^ , 100 , 101), gives: 

Gi[x,y] = iP[x,y] 

G2[x,y] = iP[x,y] 
\jdx' e-m+-'^^y+''^^+^^^-''y^d,x[x',y] 

G,[x,y] = -P[y] Dz- 
a J 



-^X[x,y] - log P[x,y] 
a ox ox 

-^X[x,y] - -^log P[x,y] 
aOy Oy 



G2[x,y] = -P[y] Dz- 
a 



- ^^fj^ +x [Ay+Bz] + ^xlx,y] 



J dx' e 2Q(i-^) 



+x'[Ay+Bz] + ^x[^',y] 



(106) 

(107) 
(108) 

(109) 
(110) 



with P[y] = {2TT)~h-^y\ 
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Since the distribution P[x,y] obeys P[x,y] = P[x\y]P[y] with P[y] = (27r)^2e~22' , our equations 
can be simplified by choosing as our order parameter function the conditional distribution We 
also replace the conjugate order parameter function x[x,y] by the effective measure M[x,y], and we 
introduce a compact notation for the relevant averages in our problem: 



{f[x,y,z])-, 



J dx AI[x, y]e f[x, y, z 
Jdx M[x,y]e^^^ 



(111) 



Instead of the original Green's function A[x,y; x' ,y'] we turn to the transformed Green's function 
A[x, y; x' ,y'], defined as 

y; x', y'] = P[x, y]A[x, y; x' , y']P[x' , y'] 
With these notational conventions one finds that ( |106| ) translates into the following expression: 

A[x,y;x',y']=Q{l-q) Ji[x,y]Ji[x' ,y']- Ji[x,y]Ji[x' ,y'] +Qq Ji[x,y]- Ji[x,y] Ji[x' ,y']- Ji[x' ,y'] 



+ R 



with 



Ji[x,y]-Ji[x,y] J2[x',y']+R Ji[x' ,y']- Ji[x' ,y'] J2[x,y] + J2[x,y]J2[x' ,y'' 



(112) 



dX " P[X\Y] Q{l-q) 



38 



r 8 T—TIY 
J,\X,Y\ = P\X\Y\-^ jDz{—\oiM[x^'\ + -^-—.).{S\X-A). 

It turns out that significant simplification of the result (|112| ) is possible, upon using the following two 
identities to rewrite the functions Ji[. . .], Ji[. . .] and J2[. . .]: 

{^\ogM[x,y\), = -Bz (113) 
(— logM[rE,y]), = ^logy dx e^"^M[x,y] (114) 



Identity (113) results upon integrating by parts with respect to x, whereas identity (|114|) is a direct 



consequence of y dependencies occurring in M[x,y\ only. Note that B = y/qQ — R'^/Q{l — q). To 
achieve the desired simplification of A[x,y;x',y'] we define the following object: 



<^[X,y] = I^Q{l-q)P[X\y]j ' Jdz{X-x)46[X-x]), (115) 

We can now, after additional integration by parts with respect to z, simplify the above expressions 
for Ji[. . .], Ji[. . .] and J2[. . .] to 

J2[X,Y]=Y -m[X,Y] 

and consequently 

y; x', y'] = P[x, y]A[x, y; x' , y']P[x' , y'] (116) 
A[x,y;x',y'] = yy' + ix-RyMx',y'] + (x' - Ry')^x,y] - {Q- R^Mx,yMx' ,y'] (117) 
with <I>[a;,y] as given in ( |115| ). 
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